46 research outputs found
Variable Selection Techniques for Clustering on the Unit Hypersphere
Mixtures of von Mises-Fisher distributions have been shown to be an effective model for clustering data on a unit hypersphere, but variable selection for these models remains an important and challenging problem. In this paper, we derive two variants of the expectation-maximization framework, which are each used to identify a specific type of irrelevant variables for these models. The first type are noise variables, which are not useful for separating any pairs of clusters. The second type are redundant variables, which may be useful for separating pairs of clusters, but do not enable any additional separation beyond the separability provided by some other variables. Removing these irrelevant variables is shown to improve cluster quality in simulated as well as benchmark datasets
Confidence Intervals for Prevalence Estimates from Complex Surveys with Imperfect Assays
We present several related methods for creating confidence intervals to
assess disease prevalence in variety of survey sampling settings. These include
simple random samples with imperfect tests, weighted sampling with perfect
tests, and weighted sampling with imperfect tests, with the first two settings
considered special cases of the third. Our methods use survey results and
measurements of test sensitivity and specificity to construct melded confidence
intervals. We demonstrate that our methods appear to guarantee coverage in
simulated settings, while competing methods are shown to achieve much lower
than nominal coverage. We apply our method to a seroprevalence survey of
SARS-CoV-2 in undiagnosed adults in the United States between May and July
2020.Comment: 45 pages, 35 figure
Semi-parametric modeling of SARS-CoV-2 transmission in Orange County, California using tests, cases, deaths, and seroprevalence data
Mechanistic modeling of SARS-CoV-2 transmission dynamics and frequently
estimating model parameters using streaming surveillance data are important
components of the pandemic response toolbox. However, transmission model
parameter estimation can be imprecise, and sometimes even impossible, because
surveillance data are noisy and not informative about all aspects of the
mechanistic model. To partially overcome this obstacle, we propose a Bayesian
modeling framework that integrates multiple surveillance data streams. Our
model uses both SARS-CoV-2 diagnostics test and mortality time series to
estimate our model parameters, while also explicitly integrating seroprevalence
data from cross-sectional studies. Importantly, our data generating model for
incidence data takes into account changes in the total number of tests
performed. We model transmission rate, infection-to-fatality ratio, and a
parameter controlling a functional relationship between the true case incidence
and the fraction of positive tests as time-varying quantities and estimate
changes of these parameters nonparameterically. We apply our Bayesian data
integration method to COVID-19 surveillance data collected in Orange County,
California between March, 2020 and March, 2021 and find that 33-62% of the
Orange County residents experienced SARS-CoV-2 infection by the end of
February, 2021. Despite this high number of infections, our results show that
the abrupt end of the winter surge in January, 2021, was due to both behavioral
changes and a high level of accumulated natural immunity.Comment: 37 pages, 16 pages of main text, including 5 figures, 1 tabl
Lectin-Dependent Enhancement of Ebola Virus Infection via Soluble and Transmembrane C-type Lectin Receptors
Mannose-binding lectin (MBL) is a key soluble effector of the innate immune system that recognizes pathogen-specific surface glycans. Surprisingly, low-producing MBL genetic variants that may predispose children and immunocompromised individuals to infectious diseases are more common than would be expected in human populations. Since certain immune defense molecules, such as immunoglobulins, can be exploited by invasive pathogens, we hypothesized that MBL might also enhance infections in some circumstances. Consequently, the low and intermediate MBL levels commonly found in human populations might be the result of balancing selection. Using model infection systems with pseudotyped and authentic glycosylated viruses, we demonstrated that MBL indeed enhances infection of Ebola, Hendra, Nipah and West Nile viruses in low complement conditions. Mechanistic studies with Ebola virus (EBOV) glycoprotein pseudotyped lentiviruses confirmed that MBL binds to N-linked glycan epitopes on viral surfaces in a specific manner via the MBL carbohydrate recognition domain, which is necessary for enhanced infection. MBL mediates lipid-raft-dependent macropinocytosis of EBOV via a pathway that appears to require less actin or early endosomal processing compared with the filovirus canonical endocytic pathway. Using a validated RNA interference screen, we identified C1QBP (gC1qR) as a candidate surface receptor that mediates MBL-dependent enhancement of EBOV infection. We also identified dectin-2 (CLEC6A) as a potentially novel candidate attachment factor for EBOV. Our findings support the concept of an innate immune haplotype that represents critical interactions between MBL and complement component C4 genes and that may modify susceptibility or resistance to certain glycosylated pathogens. Therefore, higher levels of native or exogenous MBL could be deleterious in the setting of relative hypocomplementemia which can occur genetically or because of immunodepletion during active infections. Our findings confirm our hypothesis that the pressure of infectious diseases may have contributed in part to evolutionary selection of MBL mutant haplotypes
Recommended from our members
Inference and Forecasting Using Infectious Disease Surveillance Data
Statistical modeling of infectious disease data is among the oldest applications of statistics. Today, it is an increasingly relevant application of research, due to globalization that enables diseases to spread further and faster, as well as the abundance of relevant data from electronic surveillance systems, seroprevalence studies, and genetic sequencing of pathogens. In this work, we develop novel statistical methods to combine varied data sources to improve both inference and forecasting. First, we work with data from assay validation studies and active surveillance studies to develop confidence intervals for prevalence estimates from complex surveys with imperfect assays. In this complicated setting, there are no established competitive methods, and ours exhibits at least nominal coverage. In addition, we apply our model in simplified cases where competitors exist and demonstrate desirable properties. Next, we develop a semi-parametric Bayesian compartmental model that effectively integrates passively collected time series of diagnostic tests and mortality data, as well as actively collected seroprevalence data. We emphasize retrospective inference and evaluate the utility of each data stream in the context of short-term forecasting. Finally, we focus on healthcare demand forecasting during epidemic surges of pathogen variants capable of immune escape. We build upon our Bayesian compartmental model to incorporate time series of cases, hospitalizations, ICU admissions, deaths, and genetic sequence counts. We show that using genetic information leads to superior forecasting performance, compared to traditional models. Throughout each project, we employ our methods to analyze a variety of COVID-19 data sets at the county, state, and national levels